In order to get some hands-on experience, we apply the simple linear regression in an exercise. Therefore we load the students data set. You may download the students.csv file here for exploration purposes or just follow the code lines below. Let us first import the packages we need for this exercise.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline 

We may also load the data set directly into our workspace using Pandas. We do so using the read_csv function.

In [4]:
students = pd.read_csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")

We consider an example and sample 24 random rows, each representing one student. Note that we set a random state to ensure reproducibility. Each time we run the method, the identical random rows are sampled. Furthermore, we select the columns with the variables of interest and assign them to variables x and y. Then we calculate mean values for these variables using NumPy's mean function.

In [18]:
n = 24
students_sample = students.sample(n, random_state=14)
df = students_sample[['height', 'weight']]
x = df['height']
y = df['weight']
x_bar = np.mean(x)
y_bar = np.mean(y)

To get the first idea of our data, we create a simple scatter plot using the Matplotlib library.

In [19]:
fig = plt.figure(figsize=(10,8))
plt.scatter(x, y)
plt.xlabel(x.name)
plt.ylabel(y.name)
plt.show()

The visual inspection confirms our assumption that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height, the individual student tends the have a higher weight.

Parameter estimation¶

Solving for $a$ and $b$ analytically¶

As shown in the previous section, we may calculate the parameters $a$ and $b$ of a simple linear model analytically. Recall the equation for a linear model from sample data

$$\hat y = a + b x + e \text{,}$$

for $b$

$$b = \frac{\sum_{i=1}^n ((x_i- \bar x) (y_i-\bar y))}{\sum_{i=1}^n (x_i-\bar x)^2} = \frac{cov(x,y)}{var(x)}\text{,}$$

and $a$.

$$a = \bar y -b \bar x$$

Let us try this in Python. We start by calculating $\beta_1$

In [20]:
b = np.sum((x-x_bar)*(y-y_bar))/np.sum((x-x_bar)**2)
print(b)
0.7494309085725562

The slope $b$ is approximaretely 0.75. Knowing $b$, we can calculate $a$

In [21]:
a = y_bar - b*x_bar
print(b)
0.7494309085725562

The intercept $a$ of the regression model is approximately 0.75. Thus we can write down the regression model

$$\text{weight} = a + b \times \text{height}$$

Now, based on that equation we may determine the weight of a student given its height. Here are some sample predictions of the weight of students of heights $156\ cm$, $178\ cm$, and $192\ cm$. Note that we use f-strings which are only available in Python3.6 and later.

In [31]:
for height in [156, 178, 192]:
    weight = np.round(a + b * height, 2)
    print(f"weight_{height} = {a} + {b} x {height} cm = {weight} kg")
weight_156 = -55.146696199478015 + 0.7494309085725562 x 156 cm = 61.76 kg
weight_178 = -55.146696199478015 + 0.7494309085725562 x 178 cm = 78.25 kg
weight_192 = -55.146696199478015 + 0.7494309085725562 x 192 cm = 88.74 kg

The statsmodels module provides a function for ordinary least squares models called OLS(). To get the intercept $a$, we need to provide a column of ones along with the predictor variable $\text{height}$. You can find some documentation here.

In [32]:
x_train = sm.add_constant(x)
model = sm.OLS(y, x_train)
fit = model.fit()
fit.params
Out[32]:
const    -55.146696
height     0.749431
dtype: float64

Model properties¶

As we can see, the calculations of the parameters match. We get a summary of the model properties using the following line.

In [33]:
fit.summary()
Out[33]:
OLS Regression Results
Dep. Variable: weight R-squared: 0.891
Model: OLS Adj. R-squared: 0.886
Method: Least Squares F-statistic: 178.9
Date: Tue, 20 Dec 2022 Prob (F-statistic): 4.81e-12
Time: 20:58:51 Log-Likelihood: -54.440
No. Observations: 24 AIC: 112.9
Df Residuals: 22 BIC: 115.2
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -55.1467 9.568 -5.764 0.000 -74.990 -35.304
height 0.7494 0.056 13.376 0.000 0.633 0.866
Omnibus: 3.844 Durbin-Watson: 2.262
Prob(Omnibus): 0.146 Jarque-Bera (JB): 2.935
Skew: 0.855 Prob(JB): 0.230
Kurtosis: 2.900 Cond. No. 3.28e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.28e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

We have many more methods and attributes for the class RegressionResults that our fit-object is a member of.

We can get the confidence intervals of our parameters $\beta_0$ (here called const) and $\beta_1$ (here called height) with the following method.

In [34]:
fit.conf_int()
Out[34]:
0 1
const -74.989570 -35.303822
height 0.633237 0.865625

The Pearson-residuals are available through the following parameter.

In [35]:
residuals = fit.resid_pearson

Ideally, the sum of our residuals $(\sum e)$ is close to zero.

In [36]:
print(np.sum(residuals))
9.36584143573782e-13

This is the case and strengthens our confidence in the model. Let us check for heteroscedasticity in the residuals. To do so, we plot the residuals and inspect them visually. There is a neat function residplot() in the seaborn library.

In [37]:
fig = plt.figure(figsize=(8, 6))
predicted=fit.predict(x_train)
sns.residplot(predicted, residuals)
plt.xlabel("predicted")
plt.ylabel("residuals")
plt.show()

The residuals spread independently of each other around the line. We do not see a sign of heteroscedasticity in the data.

Finally, we visualize our model. Again, we use the seaborn library.

We set a few options to get the plot we want. ci controls the confidence interval. We use the value 95, to get a 95%-interval. We disable truncate to get a continuous regression line. The ax places the plot on the specific matplotlib Axes object and label defines the string used by the legend.

For the matplotlib plots, we specify colors, markers, and labels for the legend.

get_prediction() is another particularly interesting method. The method returns $\hat y_i$ for each particular $x_i$.

Additionally, we may provide new data to predict $\hat y_i$ for any given $x_i$. Be aware that the new data must be accompanied by a column of constants. We add a column of ones.

And we add the so-called prediction bands. These include the uncertainty about future observations. They capture the majority of the observed points and do not collapse to a line as the number of observations increases. They are available via the summary_table() method of the predictions and named obs_ci_upper and obs_ci_lower. Since they represent the confidence values at the predicted points and are unsorted, we sort them and draw a line plot.

In [38]:
new_data = sm.add_constant([165, 172, 183])

pred = fit.get_prediction()
pred_new = fit.get_prediction(new_data)

fig, ax = plt.subplots(figsize=(8, 6))
sns.regplot(x, y, ci=95, truncate=False, ax=ax, label="data")
ax.scatter(x, pred.predicted_mean, color = 'red', marker="^", label='predicted values')
ax.scatter(new_data[:,1], pred_new.predicted_mean, color = 'orange', marker="s", label="new data (predicted)")
ax.plot(sorted(x), sorted(pred.summary_frame()["obs_ci_upper"]), color="k", label="prediction bands")
ax.plot(sorted(x), sorted(pred.summary_frame()["obs_ci_lower"]), color="k")
ax.legend()
plt.show()

Have another look at the model summary. We will discuss some of these parameters in the next section, where we take a look at model diagnostics

In [30]:
fit.summary()
Out[30]:
OLS Regression Results
Dep. Variable: weight R-squared: 0.891
Model: OLS Adj. R-squared: 0.886
Method: Least Squares F-statistic: 178.9
Date: Tue, 20 Dec 2022 Prob (F-statistic): 4.81e-12
Time: 20:58:15 Log-Likelihood: -54.440
No. Observations: 24 AIC: 112.9
Df Residuals: 22 BIC: 115.2
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -55.1467 9.568 -5.764 0.000 -74.990 -35.304
height 0.7494 0.056 13.376 0.000 0.633 0.866
Omnibus: 3.844 Durbin-Watson: 2.262
Prob(Omnibus): 0.146 Jarque-Bera (JB): 2.935
Skew: 0.855 Prob(JB): 0.230
Kurtosis: 2.900 Cond. No. 3.28e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.28e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.